Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
2.
J Pathol Inform ; 2: 18, 2011 Mar 31.
Artigo em Inglês | MEDLINE | ID: mdl-21572506
3.
J Pathol Inform ; 2: 5, 2011 Jan 24.
Artigo em Inglês | MEDLINE | ID: mdl-21383929

RESUMO

The day has not arrived when pathology departments freely distribute their collected anatomic and clinical data for research purposes. Nonetheless, several valuable public domain data sets are currently available, from the U.S. Government. Two public data sets of special interest to pathologists are the SEER (the U.S. National Cancer Institute's Surveillance, Epidemiology and End Results program) public use data files, and the CDC (Center for Disease Control and Prevention) mortality files. The SEER files contain about 4 million de-identified cancer records, dating from 1973. The CDC mortality files contain approximately 85 million de-identified death records, dating from 1968. This editorial briefly describes both data sources, how they can be obtained, and how they may be used for pathology research.

4.
J Pathol Inform ; 12010 Jul 13.
Artigo em Inglês | MEDLINE | ID: mdl-20805954

RESUMO

BACKGROUND: Tissue microarrays (TMAs) are enormously useful tools for translational research, but incompatibilities in database systems between various researchers and institutions prevent the efficient sharing of data that could help realize their full potential. Resource Description Framework (RDF) provides a flexible method to represent knowledge in triples, which take the form Subject-Predicate-Object. All data resources are described using Uniform Resource Identifiers (URIs), which are global in scope. We present an OWL (Web Ontology Language) schema that expands upon the TMA data exchange specification to address this issue and assist in data sharing and integration. METHODS: A minimal OWL schema was designed containing only concepts specific to TMA experiments. More general data elements were incorporated from predefined ontologies such as the NCI thesaurus. URIs were assigned using the Linked Data format. RESULTS: We present examples of files utilizing the schema and conversion of XML data (similar to the TMA DES) to OWL. CONCLUSION: By utilizing predefined ontologies and global unique identifiers, this OWL schema provides a solution to the limitations of XML, which represents concepts defined in a localized setting. This will help increase the utilization of tissue resources, facilitating collaborative translational research efforts.

5.
Nat Biotechnol ; 26(3): 305-12, 2008 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-18327244

RESUMO

One purpose of the biomedical literature is to report results in sufficient detail that the methods of data collection and analysis can be independently replicated and verified. Here we present reporting guidelines for gene expression localization experiments: the minimum information specification for in situ hybridization and immunohistochemistry experiments (MISFISHIE). MISFISHIE is modeled after the Minimum Information About a Microarray Experiment (MIAME) specification for microarray experiments. Both guidelines define what information should be reported without dictating a format for encoding that information. MISFISHIE describes six types of information to be provided for each experiment: experimental design, biomaterials and treatments, reporters, staining, imaging data and image characterizations. This specification has benefited the consortium within which it was developed and is expected to benefit the wider research community. We welcome feedback from the scientific community to help improve our proposal.


Assuntos
Imuno-Histoquímica/normas , Hibridização In Situ/normas , Biologia Computacional/métodos , Biologia Computacional/normas , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/normas , Imuno-Histoquímica/métodos , Hibridização In Situ/métodos
6.
BMC Cancer ; 7: 37, 2007 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-17386082

RESUMO

BACKGROUND: Shared Pathology Informatics Network (SPIN) is a tissue resource initiative that utilizes clinical reports of the vast amount of paraffin-embedded tissues routinely stored by medical centers. SPIN has an informatics component (sending tissue-related queries to multiple institutions via the internet) and a service component (providing histopathologically annotated tissue specimens for medical research). This paper examines if tissue blocks, identified by localized computer searches at participating institutions, can be retrieved in adequate quantity and quality to support medical researchers. METHODS: Four centers evaluated pathology reports (1990-2005) for common and rare tumors to determine the percentage of cases where suitable tissue blocks with tumor were available. Each site generated a list of 100 common tumor cases (25 cases each of breast adenocarcinoma, colonic adenocarcinoma, lung squamous carcinoma, and prostate adenocarcinoma) and 100 rare tumor cases (25 cases each of adrenal cortical carcinoma, gastro-intestinal stromal tumor [GIST], adenoid cystic carcinoma, and mycosis fungoides) using a combination of Tumor Registry, laboratory information system (LIS) and/or SPIN-related tools. Pathologists identified the slides/blocks with tumor and noted first 3 slides with largest tumor and availability of the corresponding block. RESULTS: Common tumors cases (n = 400), the institutional retrieval rates (all blocks) were 83% (A), 95% (B), 80% (C), and 98% (D). Retrieval rate (tumor blocks) from all centers for common tumors was 73% with mean largest tumor size of 1.49 cm; retrieval (tumor blocks) was highest-lung (84%) and lowest-prostate (54%). Rare tumors cases (n = 400), each institution's retrieval rates (all blocks) were 78% (A), 73% (B), 67% (C), and 84% (D). Retrieval rate (tumor blocks) from all centers for rare tumors was 66% with mean largest tumor size of 1.56 cm; retrieval (tumor blocks) was highest for GIST (72%) and lowest for adenoid cystic carcinoma (58%). CONCLUSION: Assessment shows availability and quality of archival tissue blocks that are retrievable and associated electronic data that can be of value for researchers. This study serves to compliment the data from which uniform use of the SPIN query tools by all four centers will be measured to assure and highlight the usefulness of archival material for obtaining tumor tissues for research.


Assuntos
Inclusão em Parafina/estatística & dados numéricos , Patologia Clínica/organização & administração , Bancos de Tecidos/estatística & dados numéricos , Humanos , Informática Médica/organização & administração , Neoplasias/patologia , Estados Unidos
7.
Cancer Detect Prev ; 30(5): 387-94, 2006.
Artigo em Inglês | MEDLINE | ID: mdl-17079091

RESUMO

BACKGROUND: Precancers are lesions that precede the appearance of invasive cancers. The successful prevention or treatment of precancers has the potential to eliminate deaths due to cancer. METHODS: A National Cancer Institute-sponsored Conference on Precancer was convened on November 8-9, 2004, at The George Washington University Medical Center, Washington, DC. A definition of precancers was developed over 2 days of Conference discussions. RESULTS: The following five criteria define a precancer: (1) evidence must exist that the precancer is associated with an increased risk of cancer; (2) when a precancer progresses to cancer, the resulting cancer arises from cells within the precancer; (3) a precancer differs from the normal tissue from which it arises; (4) a precancer differs from the cancer into which it develops, although it has some, but not all, of the molecular and phenotypic properties that characterize the cancer; (5) there is a method by which the precancer can be diagnosed. CONCLUSIONS: The Conference participants developed a general definition for precancers that would provide a consistent and clinically useful way of distinguishing precancers from all other types of lesions. It was recognized that many precancerous lesions may not meet this strict definition, but the group felt it was necessary to define criteria that will help standardize clinical and biological studies. Furthermore, a set of defining criteria for putative precancer lesions will permit pathologists to build a diagnostically useful taxonomy of precancers based on specified clinical and biological properties. Precancers thus characterized can be classified into clinically relevant sub-groups based on shared properties (i.e. biomarkers, oncogenes, common metabolic pathways, responses to therapy, etc.). Publications that introduce newly described precancer entities should describe how each of the five defining criteria apply. This manuscript reviews the proposed definition of precancers and suggests how pathologists, oncologists and cancer researchers may determine when these criteria are satisfied.


Assuntos
Neoplasias/patologia , Lesões Pré-Cancerosas/patologia , Humanos , National Institutes of Health (U.S.) , Estados Unidos
8.
BMC Cancer ; 6: 120, 2006 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-16677389

RESUMO

BACKGROUND: Advances in molecular biology and growing requirements from biomarker validation studies have generated a need for tissue banks to provide quality-controlled tissue samples with standardized clinical annotation. The NCI Cooperative Prostate Cancer Tissue Resource (CPCTR) is a distributed tissue bank that comprises four academic centers and provides thousands of clinically annotated prostate cancer specimens to researchers. Here we describe the CPCTR information management system architecture, common data element (CDE) development, query interfaces, data curation, and quality control. METHODS: Data managers review the medical records to collect and continuously update information for the 145 clinical, pathological and inventorial CDEs that the Resource maintains for each case. An Access-based data entry tool provides de-identification and a standard communication mechanism between each group and a central CPCTR database. Standardized automated quality control audits have been implemented. Centrally, an Oracle database has web interfaces allowing multiple user-types, including the general public, to mine de-identified information from all of the sites with three levels of specificity and granularity as well as to request tissues through a formal letter of intent. RESULTS: Since July 2003, CPCTR has offered over 6,000 cases (38,000 blocks) of highly characterized prostate cancer biospecimens, including several tissue microarrays (TMA). The Resource developed a website with interfaces for the general public as well as researchers and internal members. These user groups have utilized the web-tools for public query of summary data on the cases that were available, to prepare requests, and to receive tissues. As of December 2005, the Resource received over 130 tissue requests, of which 45 have been reviewed, approved and filled. Additionally, the Resource implemented the TMA Data Exchange Specification in its TMA program and created a computer program for calculating PSA recurrence. CONCLUSION: Building a biorepository infrastructure that meets today's research needs involves time and input of many individuals from diverse disciplines. The CPCTR can provide large volumes of carefully annotated prostate tissue for research initiatives such as Specialized Programs of Research Excellence (SPOREs) and for biomarker validation studies and its experience can help development of collaborative, large scale, virtual tissue banks in other organ systems.


Assuntos
Gestão da Informação , Aplicações da Informática Médica , Neoplasias da Próstata/patologia , Bancos de Tecidos , Bases de Dados como Assunto , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Humanos , Gestão da Informação/normas , Internet , Masculino , Marketing , Prontuários Médicos , Neoplasias da Próstata/genética , Neoplasias da Próstata/metabolismo , Controle de Qualidade , Bancos de Tecidos/normas
9.
BMC Med Inform Decis Mak ; 5: 35, 2005 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-16232314

RESUMO

BACKGROUND: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. RESULTS: A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. CONCLUSION: The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article.


Assuntos
Processamento Eletrônico de Dados/métodos , Armazenamento e Recuperação da Informação , Computação em Informática Médica , Terminologia como Assunto , Indexação e Redação de Resumos , Algoritmos , Humanos , Medical Subject Headings , National Library of Medicine (U.S.) , Neoplasias/classificação , PubMed , Semântica , Integração de Sistemas , Estados Unidos
10.
BMC Cancer ; 5: 108, 2005 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-16111498

RESUMO

BACKGROUND: The Cooperative Prostate Cancer Tissue Resource (CPCTR) is a consortium of four geographically dispersed institutions that are funded by the U.S. National Cancer Institute (NCI) to provide clinically annotated prostate cancer tissue samples to researchers. To facilitate this effort, it was critical to arrive at agreed upon common data elements (CDEs) that could be used to collect demographic, pathologic, treatment and clinical outcome data. METHODS: The CPCTR investigators convened a CDE curation subcommittee to develop and implement CDEs for the annotation of collected prostate tissues. The draft CDEs were refined and progressively annotated to make them ISO 11179 compliant. The CDEs were implemented in the CPCTR database and tested using software query tools developed by the investigators. RESULTS: By collaborative consensus the CPCTR CDE subcommittee developed 145 data elements to annotate the tissue samples collected. These included for each case: 1) demographic data, 2) clinical history, 3) pathology specimen level elements to describe the staging, grading and other characteristics of individual surgical pathology cases, 4) tissue block level annotation critical to managing a virtual inventory of cases and facilitating case selection, and 5) clinical outcome data including treatment, recurrence and vital status. These elements have been used successfully to respond to over 60 requests by end-users for tissue, including paraffin blocks from cases with 5 to 10 years of follow up, tissue microarrays (TMAs), as well as frozen tissue collected prospectively for genomic profiling and genetic studies. The CPCTR CDEs have been fully implemented in two major tissue banks and have been shared with dozens of other tissue banking efforts. CONCLUSION: The freely available CDEs developed by the CPCTR are robust, based on "best practices" for tissue resources, and are ISO 11179 compliant. The process for CDE development described in this manuscript provides a framework model for other organ sites and has been used as a model for breast and melanoma tissue banking efforts.


Assuntos
Biologia Computacional/métodos , Bases de Dados como Assunto , Neoplasias da Próstata/patologia , Bancos de Tecidos , Computadores , Humanos , Masculino , Neoplasias da Próstata/metabolismo , Recidiva , Software , Resultado do Tratamento
11.
In Silico Biol ; 5(3): 313-22, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-15984939

RESUMO

Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vocabularies that codify the names of diseases, procedures, billing categories, etc. Molecular biologists need codified medical records so that they can discover or validate relationships between experimental data and clinical data. This paper introduces a new approach to retrieving data records without prior coding. The approach achieves the same result as a search over pre-coded records. It retrieves all records that contain any terms that are synonymous with a user's query-term. A recently described fast algorithm (the doublet method) permits quick iterative searches over every synonym for any term from any nomenclature occurring in a dataset of any size. As a demonstration, a 105+ Megabyte corpus of Pubmed abstracts was searched for medical terms. Query terms were matched against either of two vocabularies and expanded as an array of equivalent search items. A single search term may have over one hundred nomenclature synonyms, all of which were searched against the full database. Iterative searches of a list of concept-equivalent terms involves many more operations than a single search over pre-annotated concept codes. Nonetheless, the doublet method achieved fast query response times (0.05 seconds using Snomed and 5 seconds using the Developmental Lineage Classification of Neoplasms, on a computer with a 2.89 GHz processor). Pre-annotated datasets lose their value when the chosen vocabulary is replaced by a different vocabulary or by a different version of the same vocabulary. The doublet method can employ any version of any vocabulary with no pre-annotation. In many instances, the enormous effort and expense associated with data annotation can be eliminated by on-the-fly doublet matching. The algorithm for nomenclature-based database searches using the doublet method is described. Perl scripts for implementing the algorithm and testing execution speed are provided as open source documents available from the Association for Pathology Informatics (www.pathologyinformatics.org/informatics_r.htm).


Assuntos
Armazenamento e Recuperação da Informação , Informática Médica , Terminologia como Assunto , Indexação e Redação de Resumos , Algoritmos , Integração de Sistemas
12.
Expert Rev Mol Diagn ; 5(3): 329-36, 2005 May.
Artigo em Inglês | MEDLINE | ID: mdl-15934811

RESUMO

Data integration occurs when a query proceeds through multiple data sets, thereby relating diverse data extracted from different data sources. Data integration is particularly important to biomedical researchers since data obtained from experiments on human tissue specimens have little applied value unless they can be combined with medical data (i.e., pathologic and clinical information). In the past, research data were correlated with medical data by manually retrieving, reading, assembling and abstracting patient charts, pathology reports, radiology reports and the results of special tests and procedures. Manual annotation of research data is impractical when experiments involve hundreds or thousands of tissue specimens resulting in large, complex data collections. The purpose of this paper is to review how XML (eXtensible Markup Language) provides the fundamental tools that support biomedical data integration. The article also discusses some of the most important challenges that block the widespread availability of annotated biomedical data sets.


Assuntos
Coleta de Dados/normas , Humanos , Internet , Lógica , Informática Médica , Prontuários Médicos , Pesquisa
13.
J Urol ; 173(5): 1546-51, 2005 May.
Artigo em Inglês | MEDLINE | ID: mdl-15821483

RESUMO

PURPOSE: Prostate cancer can occur in patients with low screening serum prostate specific antigen (PSA) values (less than 4.0 ng/ml). It is currently unclear whether these tumors are different from prostate cancer in patients with high PSA levels (greater than 4.0 ng/ml). MATERIALS AND METHODS: From the Cooperative Prostate Cancer Tissue Resource database through March 2004, 3,416 patients with screening PSA less than 16.0 ng/ml diagnosed with prostate cancer between 1993 and 2004 were stratified in groups based on screening serum PSA. These subsets were compared for race, age at diagnosis, clinical and pathological stage, Gleason score, positive surgical margins, posttreatment recurrent disease, and vital status. RESULTS: We identified 468 (14%) patients with screening PSA less than 4.0 ng/ml, 142 (4.2%) of whom had a PSA of less than 2.0 ng/ml. This group included 40 black and 376 white patients. Men with low screening PSA treated with radical prostatectomy had smaller cancers, lower Gleason scores, lower pathological tumor (T) stage and lower PSA recurrence rates than men with high PSA levels (4 ng/ml or greater). These differences held true for men who were younger than 62 years or were white, whereas older or black men had tumor characteristics and outcomes similar to those with higher PSA levels. CONCLUSIONS: Young (younger than 62 years) or white patients with screening serum PSA less than 4.0 ng/ml had smaller, lower grade tumors and lower recurrence rates than patients with PSA 4.0 ng/ml or greater. This was not true for those older than 62 years and for black men.


Assuntos
Antígeno Prostático Específico/sangue , Neoplasias da Próstata/sangue , Humanos , Masculino , Pessoa de Meia-Idade , Neoplasias da Próstata/patologia
14.
Hum Pathol ; 36(2): 139-45, 2005 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-15754290

RESUMO

It is impossible to overstate the importance of XML (eXtensible Markup Language) as a data organization tool. With XML, pathologists can annotate all of their data (clinical and anatomic) in a format that can transform every pathology report into a database, without compromising narrative structure. The purpose of this manuscript is to provide an overview of XML for pathologists. Examples will demonstrate how pathologists can use XML to annotate individual data elements and to structure reports in a common format that can be merged with other XML files or queried using standard XML tools. This manuscript gives pathologists a glimpse into how XML allows pathology data to be linked to other types of biomedical data and reduces our dependence on centralized proprietary databases.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados como Assunto/organização & administração , Informática Médica/métodos , Patologia/métodos , Linguagens de Programação , Terminologia como Assunto , Bases de Dados como Assunto/normas , Humanos
15.
AMIA Annu Symp Proc ; : 515-9, 2005.
Artigo em Inglês | MEDLINE | ID: mdl-16779093

RESUMO

The Shared Pathology Informatics Network (SPIN), a research initiative of the National Cancer Institute, will allow for the retrieval of more than 4 million pathology reports and specimens. In this paper, we describe the special query tool as developed for the Indianapolis/Regenstrief SPIN node, integrated into the ever-expanding Indiana Network for Patient care (INPC). This query tool allows for the retrieval of de-identified data sets using complex logic, auto-coded final diagnoses, and intrinsically supports multiple types of statistical analyses. The new SPIN/INPC database represents a new generation of the Regenstrief Medical Record system - a centralized, but federated system of repositories.


Assuntos
Confidencialidade , Sistemas de Gerenciamento de Base de Dados , Bases de Dados como Assunto , Armazenamento e Recuperação da Informação/métodos , Patologia , Sistemas de Informação Hospitalar , Humanos , Logical Observation Identifiers Names and Codes , Sistemas Computadorizados de Registros Médicos , Interface Usuário-Computador
16.
BMC Cancer ; 4: 88, 2004 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-15571625

RESUMO

BACKGROUND: The new "Developmental lineage classification of neoplasms" was described in a prior publication. The classification is simple (the entire hierarchy is described with just 39 classifiers), comprehensive (providing a place for every tumor of man), and consistent with recent attempts to characterize tumors by cytogenetic and molecular features. A taxonomy is a list of the instances that populate a classification. The taxonomy of neoplasia attempts to list every known term for every known tumor of man. METHODS: The taxonomy provides each concept with a unique code and groups synonymous terms under the same concept. A Perl script validated successive drafts of the taxonomy ensuring that: 1) each term occurs only once in the taxonomy; 2) each term occurs in only one tumor class; 3) each concept code occurs in one and only one hierarchical position in the classification; and 4) the file containing the classification and taxonomy is a well-formed XML (eXtensible Markup Language) document. RESULTS: The taxonomy currently contains 122,632 different terms encompassing 5,376 neoplasm concepts. Each concept has, on average, 23 synonyms. The taxonomy populates "The developmental lineage classification of neoplasms," and is available as an XML file, currently 9+ Megabytes in length. A representation of the classification/taxonomy listing each term followed by its code, followed by its full ancestry, is available as a flat-file, 19+ Megabytes in length. The taxonomy is the largest nomenclature of neoplasms, with more than twice the number of neoplasm names found in other medical nomenclatures, including the 2004 version of the Unified Medical Language System, the Systematized Nomenclature of Medicine Clinical Terminology, the National Cancer Institute's Thesaurus, and the International Classification of Diseases Oncolology version. CONCLUSIONS: This manuscript describes a comprehensive taxonomy of neoplasia that collects synonymous terms under a unique code number and assigns each tumor to a single class within the tumor hierarchy. The entire classification and taxonomy are available as open access files (in XML and flat-file formats) with this article.


Assuntos
Linhagem da Célula , Neoplasias/classificação , Bases de Dados Factuais , Feminino , Humanos , Masculino , Células-Tronco Neoplásicas/classificação
18.
Stud Health Technol Inform ; 107(Pt 1): 663-7, 2004.
Artigo em Inglês | MEDLINE | ID: mdl-15360896

RESUMO

We have developed a pipeline-based system for automated annotation of Surgical Pathology Reports with UMLS terms that builds on GATE--an open-source architecture for language engineering. The system includes a module for detecting and annotating negated concepts, which implements the NegEx algorithm--an algorithm originally described for use in discharge summaries and radiology reports. We describe the implementation of the system, and early evaluation of the Negation Tagger. Our results are encouraging. In the key Final Diagnosis section, with almost no modification of the algorithm or phrase lists, the system performs with precision of 0.84 and recall of 0.80 against a gold-standard corpus of negation annotations, created by modified Delphi technique by a panel of pathologists. Further work will focus on refining the Negation Tagger and UMLS Tagger and adding additional processing resources for annotating free-text pathology reports.


Assuntos
Algoritmos , Sistemas Computadorizados de Registros Médicos , Processamento de Linguagem Natural , Patologia Cirúrgica , Sistemas de Informação em Laboratório Clínico , Humanos , Armazenamento e Recuperação da Informação , Internet , Software , Manejo de Espécimes , Unified Medical Language System
19.
BMC Med Inform Decis Mak ; 4: 16, 2004 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-15369595

RESUMO

BACKGROUND: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. METHODS: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). RESULTS: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. CONCLUSIONS: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials.


Assuntos
Algoritmos , Processamento Eletrônico de Dados/métodos , Processamento de Linguagem Natural , Neoplasias/classificação , Terminologia como Assunto , Indexação e Redação de Resumos , Computadores , Humanos , Software , Design de Software , Unified Medical Language System
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...